Task 2: Exploratory Data Analysis on Titanic Dataset¶

📦 Import libraries¶

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load & Preview dataset¶

In [8]:
df = pd.read_csv(r"C:\Users\Maged\Desktop\TASK 2\titanic\train.csv")

df.head()
Out[8]:

🧹 Data Cleaning¶

1. Check for missing values¶

In [45]:
df.isnull().sum()
Out[45]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

2. Fill missing Age with median, Embarked with mode, drop Cabin¶

In [47]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.drop(columns=['Cabin'], inplace=True)
C:\Users\Maged\AppData\Local\Temp\ipykernel_15184\3046717843.py:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
C:\Users\Maged\AppData\Local\Temp\ipykernel_15184\3046717843.py:2: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

3. Confirm changes¶

In [49]:
df.isnull().sum()
Out[49]:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

🧾 Summary Stats & Group Insights¶

1. Basic statistics¶

In [32]:
df.describe()
Out[32]:

2. Survival rate by gender¶

In [17]:
df.groupby('Sex')['Survived'].mean()
Out[17]:
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

3. Survival by class¶

In [23]:
df.groupby('Pclass')['Survived'].mean()
Out[23]:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

4. Total Passengers, Total Survivors & Survival Rate¶

In [78]:
total_passengers = df.shape[0]
survivors = df['Survived'].sum()
survival_rate = round((survivors / total_passengers) * 100, 2)

print(f" Total Passengers: {total_passengers}")
print(f" Total Survivors: {survivors}")
print(f" Survival Rate: {survival_rate}%")
 Total Passengers: 891
 Total Survivors: 342
 Survival Rate: 38.38%

📊 Visualizations¶

In [57]:
#  Set plot style
sns.set(style="darkgrid")

1. Count of survivors by sex¶

In [59]:
sns.countplot(x='Sex', hue='Survived', data=df)
plt.title('Survival Count by Gender')
plt.show()
No description has been provided for this image

2. Survival by passenger class¶

In [63]:
sns.countplot(x='Pclass', hue='Survived', data=df)
plt.title('Survival Count by Passenger Class')
plt.show()
No description has been provided for this image

3. Age distribution by survival¶

In [66]:
plt.figure(figsize=(10, 6))
sns.kdeplot(df[df['Survived'] == 1]['Age'], label='Survived', shade=True)
sns.kdeplot(df[df['Survived'] == 0]['Age'], label='Did Not Survive', shade=True)
plt.title('Age Distribution by Survival')
plt.legend()
plt.show()
C:\Users\Maged\AppData\Local\Temp\ipykernel_15184\1747110081.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df['Survived'] == 1]['Age'], label='Survived', shade=True)
C:\Users\Maged\AppData\Local\Temp\ipykernel_15184\1747110081.py:3: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df[df['Survived'] == 0]['Age'], label='Did Not Survive', shade=True)
No description has been provided for this image

4. Heatmap of correlations¶

In [71]:
# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include='number')

# Now generate the correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(numeric_df.corr(), annot=True, cmap='Blues')
plt.title('Correlation Heatmap of Numeric Features')
plt.show()
No description has been provided for this image

5. Age Distribution by Survival¶

In [89]:
sns.histplot(data=df, x='Age', hue='Survived', kde=True, multiple='stack')
plt.title('Age Distribution by Survival')
plt.show()
No description has been provided for this image

SUMMARY¶

- Females had a higher survival rate than males.¶

- 1st class passengers survived more than 3rd class.¶

- Passengers aged 20–40 had mixed survival rates.¶

- Embarked from port 'C' had better survival rates.¶